Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 11 23:52:21 2023"
The text continues here.
I have really enjoyed the start of this course. As a PhD student, I feel that when starting my PhD I had to dive into the deep end of R, starting quite early on with for example single cell analysis, WES and WGS data and I did not really have so much time to focus on the basics.
I liked the set-up of the exercise set, it is user friendly and nice to go back to the basics. Some was quite repetitive and some types of plots I won’t ever use in my own research, but I guess it’s good to know in general.
I liked the trick about not leaving install.packages in the code but rather commenting it out and then selecting it later, this was useful.
I didn’t really understand the examples with the dates.
Pivot_longer was aslo useful, many times I striggle with this and try to google how to do it again.
I hope I will get back to the basics with this course, and I feel the exercise active workbook is a good resource for later on when you forget how to do something.
my github: https://github.com/nygr/IODS-project ***
#Petra Nygren
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 11 23:52:21 2023"
“Sun Nov 12 12:27:17 2023”
Read in data needed for exercise
# set working directory and read in previously created csv file
setwd("/Users/nygrpetr/Desktop/iods/iods_petra/")
my_data <- read.csv("data/learning2014.csv")
The dataset we are using in this exercise consists of learning data of students. It contains the gender (male or female), age (numerical value) of the students. Variables “deep”, “strategic” and “surface” give information about the study approach the students used (deep learning, strategic learning or surface learning). The values are means, and variables having the value 0 have been omitted. Points tells us about the exam points received by that person and attitude a numerical value to represent the attitude of the person towards statistics. The dataset contains information in integer, numerical and character classes.
# explore dimensions dim() (number of rows and columns) and structure str() of the data
dim(my_data)
## [1] 166 8
str(my_data)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude : int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surface : num 2.58 3.17 2.25 2.25 2.83 ...
## $ strategic: num 3.38 2.75 3.62 3.12 3.62 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
#Remove extra variable "X"
my_data$X <- NULL
Besides the age of the subjects, other variable appear to have normal distrubtions. For age, most subjects are between 20 and 25 years and the data has a right-skewed distribution, median age is 22 years. Attitude, points are the most clearly normally distributed and symmetrical, whereas deep and surface have slight skews in their distributions.
Exam points and attitude have the most clear positive correlation - higher attitude points seem to correlate with higher exam points.
# install (if not yet installed) and load library
# install.packages("ggplot2")
# install.packages("GGally")
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#Plots some histograms to check distributions of the data
hist(my_data$age)
hist(my_data$attitude)
hist(my_data$deep)
hist(my_data$surface)
hist(my_data$strategic)
hist(my_data$points)
#See summary of the data
summary(my_data)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :14.00 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Mode :character Median :22.00 Median :32.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :4.917
## surface strategic points
## Min. :1.583 Min. :1.250 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:2.625 1st Qu.:19.00
## Median :2.833 Median :3.188 Median :23.00
## Mean :2.787 Mean :3.121 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:3.625 3rd Qu.:27.75
## Max. :4.333 Max. :5.000 Max. :33.00
#Graphical overview
#Pairs creates a matrix of plots showing correlations between numerical variables. The first variable, gender, is subsetted out because it is a non-numerical variable
pairs(my_data[,-1])
#Alternatively ggpairs can be used to create a plot matrix
ggpairs(my_data, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
From the graphical overview, we are able to see that attitude, strategic
learning and surface learning are the top 3 varaibles which have a
correlation with exam points. Of these, attitude has the strongest
correlation and surface learning the weakest (visually inspected). Now
we fit a linear model using these three variables and points as the
response variable
# fit a model with three
model <- lm(points ~ attitude + surface + strategic, data = my_data)
# print out summary
summary(model)
##
## Call:
## lm(formula = points ~ attitude + surface + strategic, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## surface -0.58607 0.80138 -0.731 0.46563
## strategic 0.85313 0.54159 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#### Only attitude is significant so we try again with only attitude
model_new <- lm(points ~ attitude, data = my_data)
# print out summary
summary(model_new)
##
## Call:
## lm(formula = points ~ attitude, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Of the used input variables, only attitude towards statistics had a significant effect on the exam points so we ran a new model with only attitude towards statistics as input variable and exam points as response variable.When attitude increases, exam points increase by 0.35units.R-squared is a measure of how well the model fits the data. It ranges from 0 to 1, with higher values indicating a better fit. With this model, we get the result that R squared is 0.1906, so 19% of changes in exam points can be explained by our input, attitude variable. Thus, there are other variables that contribute around 80% to the exam points that are not available in this dataset. My conclusion from this linear model is that students with more positive attitudes towards statistics achieve higher exam points, but attitude alone is not enough to explain good exam success.
```r
# Diagnostic plots using plot() function
par(mfrow=c(2,2))
plot(model_new)
A linear model has several assumptions that must be met in order for the results to be valid. It assumes that: 1.the relationship between two variables in linear 2.the tested variables are independant 3. data is normally distributed and 4. Homoscedasticity: The residuals have constant variance at every level of the independent variable(s). This means that the variance of the residuals is the same across all levels of the independent variable(s).
Above, we used R’s plot() function to test if our data fills these assumptions.
The plot() function produces by default, four diagnostic plots by. The first plot is the Residuals vs Fitted plot, which is the plot of the residuals against the fitted values. This plot is used to check for non-linear patterns in the residuals. If there is a clear pattern in the residuals, it suggests that the linear regression model may not be appropriate for the data. In the plot we have plotted, there is no clear pattern so I think our data fills the assumption of linear relationship
The second plot is the QQ-plot, which is used to check for normal distribution. If the residuals are normally distributed, the points in the QQ-plot should fall along a straight line. If the points deviate from a straight line, it suggests that the residuals are not normally distributed. In our data, the points generally follow the straight line so I would say our data follows teh normal distrubution. There is some deviation toward the bottom and top of the plot but I would allow this.
The third plot is the Scale-Location plot, which is used to check for homoscedasticity. Homoscedasticity means that the variance of the residuals is constant across all levels of the predictor variables. In the plot, the red line represents the fitted values, and the blue line represents the square root of the absolute residuals. If the red line is roughly horizontal it suggests that the residuals have constant variance.In our plot, the red line I would intrepret is mostly horizontal so we can assume our data is homoscedastic.
The fourth plot is the Residuals vs Leverage plot, which is used to identify influential observations. An observation is influential if it has a large residual and/or a high leverage value. In general, an observation is influential if it has a large effect on the slope of the regression line. In the plot, the Cook’s distance is used to identify influential observations. If an observation has a Cook’s distance greater than 1, it is considered influential. In our data it seems there may be some influential outlier observations that have high effect on the line, so in order to produce an accurate result, outliers should be removed and the analysis repeated. However, in the interest of time I have not proceeded with those steps.
Information on the assumptions and plot types was found on 1. statology.org 2. geeksforgeeks.org 3. upgrad.com
# set working directory and read in previously created txt file
setwd("/Users/nygrpetr/Desktop/iods/iods_petra/")
alc_data <- read.csv("data/alc_data.csv")
print(colnames(alc_data))
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset we are using includes data on students’ achievement in secondary education of two schools in Portugal. The variables include student grades, demographic such as mother and father education, social and school related features. The two datasets concern different subjects: Mathematics (mat) and Portuguese language (por). We have added a column alc_use which is the average of weekday and weekend alcohol use.
Here, we aim to study the relationships between alcohol use and the variables in the data. The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.
I chose to study 4 variables: sex, Pstatus, Medu (mother’s education) and health. I hypothesize that 1) Male sex will correlate with higher alcohol use 2) Pstatus - students with parents who live together would use less alcohol 3) Students whose mothers are highly educated would use less alcohol 4) Students with better health status use less alchohol
Next, we will explore the distributions of the previously mentioned variables and their association with alcohol use
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(alc_data, aes(sex)) + geom_bar()
ggplot(alc_data, aes(Pstatus)) + geom_bar()
ggplot(alc_data, aes(Medu)) + geom_bar()
ggplot(alc_data, aes(health)) + geom_bar()
In the dataset, there are more female students than male, most of the
students’ parents live together, more students have highly educated
mothers than lowly educated, and more students have the highest degree
of health than the lowest degree
#Check what proprotions of students in each category (sex: M/F, Pstatus: T/A, Medu: high values to low values, health: low val to high) belong to high alcohol use group
alc_data %>% group_by(sex, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count prop
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 0.790
## 2 F TRUE 41 0.210
## 3 M FALSE 105 0.6
## 4 M TRUE 70 0.4
alc_data %>% group_by(Pstatus, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'Pstatus'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups: Pstatus [2]
## Pstatus high_use count prop
## <chr> <lgl> <int> <dbl>
## 1 A FALSE 26 0.684
## 2 A TRUE 12 0.316
## 3 T FALSE 233 0.702
## 4 T TRUE 99 0.298
alc_data %>% group_by(Medu, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'Medu'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups: Medu [5]
## Medu high_use count prop
## <int> <lgl> <int> <dbl>
## 1 0 FALSE 1 0.333
## 2 0 TRUE 2 0.667
## 3 1 FALSE 31 0.633
## 4 1 TRUE 18 0.367
## 5 2 FALSE 78 0.812
## 6 2 TRUE 18 0.188
## 7 3 FALSE 57 0.613
## 8 3 TRUE 36 0.387
## 9 4 FALSE 92 0.713
## 10 4 TRUE 37 0.287
alc_data %>% group_by(health, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'health'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups: health [5]
## health high_use count prop
## <int> <lgl> <int> <dbl>
## 1 1 FALSE 35 0.761
## 2 1 TRUE 11 0.239
## 3 2 FALSE 28 0.667
## 4 2 TRUE 14 0.333
## 5 3 FALSE 61 0.762
## 6 3 TRUE 19 0.238
## 7 4 FALSE 45 0.726
## 8 4 TRUE 17 0.274
## 9 5 FALSE 90 0.643
## 10 5 TRUE 50 0.357
ggplot(data = alc_data, aes(x = sex, fill = high_use)) +
geom_bar(position = "dodge")
ggplot(data = alc_data, aes(x = Pstatus, fill = high_use)) +
geom_bar(position = "dodge")
ggplot(data = alc_data, aes(x = Medu, fill = high_use)) +
geom_bar(position = "dodge")
ggplot(data = alc_data, aes(x = health, fill = high_use)) +
geom_bar(position = "dodge")
My prediction of sex and alcohol use seems to have some truth behind it,
with Pstatus there seems to be equal proportions of high and low alc use
in children of parents who cohabitate and thsoe who do not. Contrary to
my prediction, based on visual analysis of the data it seems that
students with more highly educated mothers seem to drink more alcohol
and those with better health seem to also drink more alcohol so I doubt
my hypothesis’ concerning those.
Next, we will run a regression model to test the relationships between the variables and alcohol use. We will process factor variables and categorical variables in a different way.
#Turn Medu and health into factors to enable them to be used in the model
alc_data$Medu <- factor(alc_data$Medu)
alc_data$health<- factor(alc_data$health)
#Run regression model
my.mod <- glm(high_use ~ sex + Pstatus + Medu + health, data = alc_data, family = "binomial")
summary(my.mod)
##
## Call:
## glm(formula = high_use ~ sex + Pstatus + Medu + health, family = "binomial",
## data = alc_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03050 1.35266 0.023 0.98201
## sexM 0.92338 0.24456 3.776 0.00016 ***
## PstatusT -0.09829 0.39260 -0.250 0.80232
## Medu1 -1.12776 1.29505 -0.871 0.38385
## Medu2 -2.16404 1.28249 -1.687 0.09153 .
## Medu3 -1.16803 1.27388 -0.917 0.35919
## Medu4 -1.69835 1.27132 -1.336 0.18158
## health2 0.55552 0.50095 1.109 0.26745
## health3 0.02365 0.45133 0.052 0.95821
## health4 0.28251 0.46379 0.609 0.54243
## health5 0.44817 0.40569 1.105 0.26929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 420.13 on 359 degrees of freedom
## AIC: 442.13
##
## Number of Fisher Scoring iterations: 4
#Compute odds ratios and confidence intervals, print the results
OR <- coef(my.mod) %>% exp
CI <- confint(my.mod) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.0309650 0.076881229 25.754772
## sexM 2.5177759 1.566506913 4.093497
## PstatusT 0.9063878 0.427092627 2.013112
## Medu1 0.3237572 0.013860043 3.858076
## Medu2 0.1148602 0.004990072 1.336025
## Medu3 0.3109786 0.013654668 3.562161
## Medu4 0.1829843 0.008056463 2.084139
## health2 1.7428548 0.655437479 4.730204
## health3 1.0239323 0.426979687 2.535839
## health4 1.3264582 0.538881788 3.360685
## health5 1.5654372 0.722581102 3.586186
Of our 4 chosen variables, only sex had a statistically significant effect on alcohol use. Male sex was associated with increase of high_use response variable being TRUE (high_use) by 0.90043. Odds ratio is a “relative measure of effect”, which allows the comparison of the test group of a study relative to the control, variables with odds ratio = 1 do not have a statistically significant effect on the output variable , the odds ratio for sex is the only one which is above 1, being 2.46 with confidence intervals of 1.5517890 to 3.9421063.
Next we will use only the variable “sex” which was the only significant one to fit a new model, and use it to predict probabilities of high use tehn compare it to the actual values to investigate the predictive power of the model.
new.mod <- glm(high_use ~ sex, data = alc_data, family = "binomial")
probabilities <- predict(new.mod, type = "response")
#Add column for probabilities, use it to make predictions of high use, and get teh last 10 values
alc_data <- mutate(alc_data, probability = probabilities)
alc_data <- mutate(alc_data, prediction = probability > 0.5)
select(alc_data, sex, high_use, probability, prediction) %>% tail(10)
## sex high_use probability prediction
## 361 M FALSE 0.4000000 FALSE
## 362 M FALSE 0.4000000 FALSE
## 363 M TRUE 0.4000000 FALSE
## 364 F FALSE 0.2102564 FALSE
## 365 F FALSE 0.2102564 FALSE
## 366 F FALSE 0.2102564 FALSE
## 367 F FALSE 0.2102564 FALSE
## 368 F FALSE 0.2102564 FALSE
## 369 M TRUE 0.4000000 FALSE
## 370 M TRUE 0.4000000 FALSE
#Make a table
table(high_use = alc_data$high_use, prediction = alc_data$prediction)
## prediction
## high_use FALSE
## FALSE 259
## TRUE 111
prop.table(table(high_use = alc_data$high_use, prediction = alc_data$prediction)) * 100
## prediction
## high_use FALSE
## FALSE 70
## TRUE 30
I would say the the model is good but not great for this data, 259 observations with FALSE for high use corresponded to the predictions of FALSE, but the model misstakenly predicted FALSE for 111 high use values which were actually true, so I would not use this model as the predictive power is subpar. ***
output: html_document date: “2023-11-26” —
Here, we use the package “MASS” to load in dataset “Boston”, which contains information about Housing Values in Suburbs of Boston. This dataframe has 506 rows and 14 columns. Contains numeric and integer variables.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
Here we show a graphical overview of the data and show summaries of the variables in the data. We want to use the data in a long format and for this we use melt(), then we examine the distributions of the variables and the correlations between them.
We see that most of the variables are not normally distributed, rm seems to be the most normally distrubted variable. Other variables have two peaked distributions (e.g indus) and some are left skewed (e.g. dis), and other right skewed (e.g. ptratio).
From the correlation plots we can see that istat and medv, age and dis, nox and dis, indus and dis are all strongly negatively correlated. Rad and tax are strongly positively correlated. The plot is counterintuitive because according to the legend red is -1 (neg) and blue is 1 (pos)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(dplyr)
require(reshape2)
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
data_boston_long <- melt(Boston)
## No id variables; using all as measure variables
ggplot(data = data_boston_long, aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
mat_Boston <- cor(Boston) %>% round(2)
corrplot(mat_Boston, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
We then scale the data and observing the summary of the new data, we see that the means of the new data are 0 when before they ranged from 0-68. Then we create a variable for crime and add it to the data. Finally we divide the data to test and train sets
# center and standardize variables
scaled_data <- scale(Boston)
# print summaries of the scaled variables
summary(scaled_data) # note that now all means are 0 as supposed to
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Object is matrix so change to df
scaled_dat_df <- as.data.frame(scaled_data)
# Create categorical "crime" variable and add it to the dataset
bins <- quantile(scaled_dat_df$crim)
crime <- cut(scaled_dat_df$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
scaled_dat_df <- dplyr::select(scaled_dat_df, -crim)
scaled_dat_df <- data.frame(scaled_dat_df, crime)
#Create train and test sets of the data such that 80% of the data is in the train set
n <- nrow(scaled_dat_df)
ind <- sample(n, size = n * 0.8)
train <- scaled_dat_df[ind,]
test <- scaled_dat_df[-ind,]
Next, we fit a linear discriminant analysis (lda) on the train set. We are using our created catergorical crime rate as the target variable and all the other variables in the dataset as predictor variables, and then we draw a bi-plot.
#Fit lda on train set
lda.fit <- lda(crime ~ . , data = train)
# Draw bi-plot
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
We used the test data to predict crime rates and then compared it to the true rates. In the table we see that the analysis performs well for the low and high crime rate categories, especially is predcited correctly all for the high crime rate category but in the med_low and med_high it seems to perform more poorly.
#Save crime rates from the test data set and then drop the varibale from test df
test_classes <- test$crime
test <- dplyr::select(test, -crime)
#Predict classes with test data and cross tabulate to compare with the real classes
lda.prediction <- predict(lda.fit, newdata = test)
table(correct = test_classes, predicted = lda.prediction$class)
## predicted
## correct low med_low med_high high
## low 15 12 2 0
## med_low 7 21 4 0
## med_high 1 4 14 1
## high 0 0 0 21
We reload the data, calculate euclidian distance and run k means. Optimal number of clusters I would say is 6, as this is where the curve flattens off
# Load and scale data, caluclate euclidian distances
data("Boston")
scaled_data <- scale(Boston)
euclidian_dist <- dist(scaled_data)
# Determine the number of clusters and run k means
set.seed(246)
k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(scaled_data, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
km <- kmeans(scaled_data, centers = 6)
# Visualize results
pairs(scaled_data, col = km$cluster)
output: html_document date: “2023-12-03” — First we read in the data and explore structure and dimensions, then we set country names to rownames. After that we show a graphical overview and observe distributions and correlations between the variables.
#Read in data, explore, and make country col -> rownames
human <- read.csv("human_filtered.csv")
dim(human)
## [1] 155 10
#There is an extra column for row number "X" so lets remove that
human$X <- NULL
str(human)
## 'data.frame': 155 obs. of 9 variables:
## $ country : chr "Norway" "Australia" "Switzerland" "Denmark" ...
## $ f_m_edu : num 1.007 0.997 0.983 0.989 0.969 ...
## $ f_m_labour : num 0.891 0.819 0.825 0.884 0.829 ...
## $ expected_edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ life_exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ maternal_mortality: int 4 6 6 5 6 7 9 28 11 8 ...
## $ adolesc_birth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ repr_parl : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
rownames(human) <- human$country
human$country <- NULL
#Graphical overview
library(GGally)
ggpairs(human)
summary(human)
## f_m_edu f_m_labour expected_edu life_exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni maternal_mortality adolesc_birth repr_parl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Observing the results of ggpairs, we see that only expected years of education (expected_edu) is the only variable following a normal distribution, other variables seem to be skewed. Life expectancy and expected years of education are strongly correlated positively, other positively correlated pairs of variables are expected edu and gni, maternal mortality ratio and adolescent birth rate.
Next we perform a principal component analysis which is a form of dimensionality reduction and tells us how much of the variablity of the data is explained by some of the variables, or principal components and draw a biplot to visualize the results.
# PCA and summary
pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(1*s$importance[2, ], digits = 5)
print(pca_pr)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Biplot
biplot(pca_human, cex = c(1, 1), col = c("grey40", "blue"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
The PCA with the raw data shows that one variable, gni, explains 99% of
the variability in the data.
Next we standardize the data and repeat the analysis
# Scale data
human_scaled <- scale(human)
# Repeat PCA with scaled data
pca_human_s <- prcomp(human_scaled)
s_scaled <- summary(pca_human_s)
pca_pr_scaled <- round(1*s_scaled$importance[2, ], digits = 5)
print(pca_pr_scaled)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
pc_lab_scaled <- paste0(names(pca_pr_scaled), " (", pca_pr, "%)")
# Biplot
fig.width=10
biplot(pca_human_s, cex = c(0.8, 1), col = c("grey40", "blue"), xlab = pc_lab_scaled[1], ylab = pc_lab_scaled[2])
With the scaled data, PC1 explains 53% of the variability and PC2
explains 16% of the variability. This differs from the results of raw
data, where it seemed like PC1 explained all of the variability (99%).
In the raw data, gni so gross national income explained all the
variability. After standardization, female to male labour ratio and
Percent Representation in Parliament explained variability on the y
axis, and other variables explain the variance of x axis so PC1.
Now we continue with tea dataset from FactoMine package, we explore the data and then perform an MCA analysis
#Load data, explore structure and dimensions
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
#Visualize the data
library(ggplot2)
library(dplyr)
library(tidyr)
#select some columns to visualize because there are many hard to visualise and imo some useless columns
tea_cols <- dplyr::select(tea, Tea, How, sugar, where, healthy)
pivot_longer(tea_cols, cols = everything()) %>%
ggplot(aes(value)) + geom_bar() + facet_wrap("name", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Now we perform Multiple Correspondence Analysis (MCA) on the tea data.
# MCA
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
mca_tea <- MCA(tea_cols, graph = FALSE)
summary(mca_tea)
##
## Call:
## MCA(X = tea_cols, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.276 0.264 0.240 0.203 0.202 0.181 0.167
## % of var. 15.336 14.659 13.327 11.265 11.198 10.080 9.279
## Cumulative % of var. 15.336 29.995 43.322 54.587 65.785 75.865 85.144
## Dim.8 Dim.9
## Variance 0.144 0.124
## % of var. 7.980 6.876
## Cumulative % of var. 93.124 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.126 0.019 0.014 | -0.175 0.039 0.027 | -0.296 0.122
## 2 | 0.622 0.467 0.221 | -0.280 0.099 0.045 | -0.461 0.295
## 3 | -0.064 0.005 0.007 | -0.286 0.104 0.136 | -0.261 0.095
## 4 | -0.411 0.204 0.267 | 0.105 0.014 0.017 | -0.205 0.058
## 5 | -0.417 0.210 0.177 | -0.168 0.036 0.029 | -0.361 0.181
## 6 | -0.064 0.005 0.007 | -0.286 0.104 0.136 | -0.261 0.095
## 7 | -0.064 0.005 0.007 | -0.286 0.104 0.136 | -0.261 0.095
## 8 | 0.622 0.467 0.221 | -0.280 0.099 0.045 | -0.461 0.295
## 9 | 0.150 0.027 0.011 | 0.536 0.362 0.137 | -0.195 0.053
## 10 | 0.889 0.955 0.507 | -0.150 0.028 0.014 | -0.078 0.008
## cos2
## 1 0.078 |
## 2 0.121 |
## 3 0.113 |
## 4 0.067 |
## 5 0.132 |
## 6 0.113 |
## 7 0.113 |
## 8 0.121 |
## 9 0.018 |
## 10 0.004 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## black | 1.053 19.828 0.363 10.422 | -0.309 1.788 0.031 -3.060
## Earl Grey | -0.355 5.888 0.228 -8.254 | 0.410 8.195 0.303 9.521
## green | -0.283 0.640 0.010 -1.723 | -1.704 24.216 0.359 -10.360
## alone | -0.229 2.473 0.098 -5.400 | -0.362 6.441 0.243 -8.520
## lemon | 0.088 0.062 0.001 0.535 | 1.161 11.230 0.166 7.055
## milk | 0.165 0.415 0.007 1.472 | 0.374 2.226 0.037 3.334
## other | 3.486 26.416 0.376 10.601 | 0.961 2.099 0.029 2.922
## No.sugar | 0.440 7.237 0.207 7.861 | -0.486 9.239 0.252 -8.684
## sugar | -0.470 7.736 0.207 -7.861 | 0.519 9.876 0.252 8.684
## chain store | -0.302 4.242 0.163 -6.973 | -0.207 2.079 0.076 -4.773
## Dim.3 ctr cos2 v.test
## black | -0.277 1.573 0.025 -2.736 |
## Earl Grey | -0.053 0.151 0.005 -1.231 |
## green | 0.930 7.937 0.107 5.655 |
## alone | -0.162 1.414 0.048 -3.806 |
## lemon | 1.931 34.198 0.461 11.739 |
## milk | -0.427 3.198 0.049 -3.810 |
## other | -0.589 0.868 0.011 -1.791 |
## No.sugar | -0.066 0.189 0.005 -1.183 |
## sugar | 0.071 0.202 0.005 1.183 |
## chain store | -0.432 9.939 0.331 -9.950 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.364 0.451 0.116 |
## How | 0.405 0.290 0.476 |
## sugar | 0.207 0.252 0.005 |
## where | 0.224 0.306 0.590 |
## healthy | 0.181 0.020 0.012 |
# Visualize
plot(mca_tea, invisible=c("ind"), graph.type = "classic", habillage = "quali")
fviz_mca_biplot(mca_tea,
repel = TRUE, # Avoid text overlapping (slow if many point)
ggtheme = theme_minimal())
From the results we look at the variables with positive dimensions such as black, no sugar, other, lemon, milk which are the most linked, and negative varibales are the least linked such as green, tea shop, chain store. Variance explained by dimension one and dimension two was 15.34% and 14.66%, respectively. No clear clusters appear on the factor map.
output: html_document date: “2023-12-11” — Longitudinal data
Part 1: RATS
RATS dataset is data from a study conducted on 3 groups of rats where rats were fed different diets and body weights were used to measure the growth of the rats, the objective was to find out whether the different diets had effects on rat growth
#We read in the data and convert id and group to factors, convert to long format
library(dplyr)
library(ggplot2)
library(tidyr)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATS_long <- pivot_longer(RATS, cols = -c(ID, Group),
names_to = "groups",
values_to = "wd") %>%
mutate(Time = as.integer(substr(groups, 3, 4))) %>%
arrange(Time)
Next we visualize the data, we see that the rats all have a tendency to grow during the experiment. The initial bodyweights of rats in group 3 seem to be the highest and in group 1 the lowest. Based on this it seems like there is an outlier in group 2 but lets check that later.
ggplot(RATS_long, aes(x = Time, y = wd, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times = 4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS_long$wd), max(RATS_long$wd)))
There appears to be a tracking phenomenon where rats with higher weight
in the beginning seem to gain more weight and thus have higher weights
throughout the study - a tracking phenomenon. In an ideal situation the
rats would be split into groups where each group has a similar weight
distribution not that there are huge differences in the starting
weights. Could there be some biological reason why some mice are heavier
to start off with and thus more likely to gain weight -something to do
with genetics perhaps?
Now we standardize the data in order to more clearly see the tracking phenomenon
library(dplyr)
library(tidyr)
#Now we standardize and repeat the plot
RATS_long <- RATS_long %>%
group_by(Time) %>%
mutate(stdwd = scale(wd)) %>%
ungroup()
ggplot(RATS_long, aes(x = Time, y = stdwd, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times = 4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized wd")
Lets now plot the means and standard deviations for each treatment group at each time point. The following line plot is difficult to interpret because the baseline weights are so different, but it seems, on average over the course of the diet time group 2 rats gained the most weight. However, group 2 has the most standard deviations, which could possibly be due to the possible outlier (rat number 12), but since there are so few rats in that group lets just leave it for now.
n <- table(RATS_long$Group)
RATSS <- RATS_long %>%
group_by(Group, Time) %>%
summarise(mean =mean(wd), se = sd(wd)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, colour = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.4)) +
scale_y_continuous(name = "mean rat weight +/- sd(rat weight)")
Now we calculate a summary measure, removing observation from the baseline so from week 1 and draw boxplots with it. From these boxplots we see that in addition to the outlier in group 2 there are also outliers in groups 1 and 3. Since the outlier in group 2 is very obvious, moreso than groups 1 and 3 lets remove it, also its easier to remove.
RATS_filt <- RATS_long %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise(mean = mean(wd)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATS_filt, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape = 23, size = 4, fill = "white") +
scale_y_continuous(name = "mean(wd), weeks 1-9")
RATS_filt <- RATS_filt %>% filter(mean < 580)
Based on visualisation it seems that there is a difference between the diets offered to the rats, but we need to test this with a statistical test. Based on the t test, all of the groups tested against each other have significant differences. However, I would not trust this result very much as there were very few samples per group, some outliers existed and the baseline characteristics of the rats were different which may have underlying differences in biology / genetics / metabolics
t.test(mean ~ Group, data = filter(RATS_filt,Group!=3), var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -25.399, df = 9, p-value = 1.094e-09
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -204.0633 -170.6867
## sample estimates:
## mean in group 1 mean in group 2
## 265.025 452.400
t.test(mean ~ Group, data = filter(RATS_filt,Group!=2), var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
## -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3
## 265.025 527.500
t.test(mean ~ Group, data = filter(RATS_filt,Group!=1), var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -109.37769 -40.82231
## sample estimates:
## mean in group 2 mean in group 3
## 452.4 527.5
Part 2: BPRS
This dataset contains data on 40 subject and their rating on the brief psychiatric rating scale (used to evaluate schizophrenia) before and after treatment (split into 2 treatment groups). The scale assessed different symptoms such as hallucinations from a scale of 1(low) to 7(high).
First we read in the data and re-factor the variables
BPRS_long <- read.csv("/Users/nygrpetr/Desktop/iods/iods_petra/data/bprs_long.csv")
BPRS_long$treatment <- factor(BPRS_long$treatment)
BPRS_long$subject <- factor(BPRS_long$subject)
glimpse(BPRS_long)
## Rows: 360
## Columns: 6
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Next we visualize the data, upon visualization there may be an outlier in treatment group 2 but overall they appear to be quite similar
ggplot(BPRS_long, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRS_long$bprs), max(BPRS_long$bprs)))
Next, we will fit a linear regression model and ignore the fact that the
observations are from the same subject but assume they are separate.
With this model we see that treatment does not seem to have an effect
but time does have a significant effect
fit <- lm(bprs ~ treatment + week, data = BPRS_long)
summary(fit)
##
## Call:
## lm(formula = bprs ~ treatment + week, data = BPRS_long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## week -2.2704 0.2524 -8.995 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Next, because the observations are not actually independant, we fit a random intercept model with ID as the the random effect
#install.packages("lme4")
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRS_random <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_long, REML = FALSE)
summary(BPRS_random)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS_long
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Next we try a random intercept and random slope model on the data
# create a random intercept and random slope model
BPRS_random1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_long, REML = FALSE)
summary(BPRS_random1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS_long
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
Then we use anova to test which model fits the data better,
anova(BPRS_random1, BPRS_random)
## Data: BPRS_long
## Models:
## BPRS_random: bprs ~ week + treatment + (1 | subject)
## BPRS_random1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_random 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_random1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The random intercept and random slope model has significant p value and thus fits the data better.
Next we fit a linear mixed random slope and intercept model with interaction between time and treatment. Then again we use anova to test which one fits the data better, the improvement doesnt really do much, no significant difference
BPRS_random2 <- lmer(bprs ~ week + treatment + (week | subject) + week * treatment, data = BPRS_long, REML = FALSE)
summary(BPRS_random2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
## Data: BPRS_long
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_random2, BPRS_random1)
## Data: BPRS_long
## Models:
## BPRS_random1: bprs ~ week + treatment + (week | subject)
## BPRS_random2: bprs ~ week + treatment + (week | subject) + week * treatment
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_random1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_random2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now lets plot the fitted values, with the fitted values it seems that the BPRS scores of both groups seem to decrease with time thus both treatments are working to reduce schizophrenic symptoms.
BPRS_long$fit <- fitted(BPRS_random2)
ggplot(BPRS_long, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(color = treatment)) +
scale_x_continuous(name = "Time", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top")
ggplot(BPRS_long, aes(x = week, y = fit, group = subject)) +
geom_line(aes(color = treatment)) +
scale_x_continuous(name = "Time", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Fitted BPRS") +
theme(legend.position = "top")